{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 梯度下降\n",
    ":label:`sec_gd`\n",
    "\n",
    "尽管*梯度下降*（gradient descent）很少直接用于深度学习，\n",
    "但了解它是理解下一节随机梯度下降算法的关键。\n",
    "例如，由于学习率过大，优化问题可能会发散，这种现象早已在梯度下降中出现。\n",
    "同样地，*预处理*（preconditioning）是梯度下降中的一种常用技术，\n",
    "还被沿用到更高级的算法中。\n",
    "让我们从简单的一维梯度下降开始。\n",
    "\n",
    "## 一维梯度下降\n",
    "\n",
    "为什么梯度下降算法可以优化目标函数？\n",
    "一维中的梯度下降给我们很好的启发。\n",
    "考虑一类连续可微实值函数$f: \\mathbb{R} \\rightarrow \\mathbb{R}$，\n",
    "利用泰勒展开，我们可以得到\n",
    "\n",
    "$$f(x + \\epsilon) = f(x) + \\epsilon f'(x) + \\mathcal{O}(\\epsilon^2).$$\n",
    ":eqlabel:`gd-taylor`\n",
    "\n",
    "即在一阶近似中，$f(x+\\epsilon)$可通过$x$处的函数值$f(x)$和一阶导数$f'(x)$得出。\n",
    "我们可以假设在负梯度方向上移动的$\\epsilon$会减少$f$。\n",
    "为了简单起见，我们选择固定步长$\\eta > 0$，然后取$\\epsilon = -\\eta f'(x)$。\n",
    "将其代入泰勒展开式我们可以得到\n",
    "\n",
    "$$f(x - \\eta f'(x)) = f(x) - \\eta f'^2(x) + \\mathcal{O}(\\eta^2 f'^2(x)).$$\n",
    ":eqlabel:`gd-taylor-2`\n",
    "\n",
    "如果其导数$f'(x) \\neq 0$没有消失，我们就能继续展开，这是因为$\\eta f'^2(x)>0$。\n",
    "此外，我们总是可以令$\\eta$小到足以使高阶项变得不相关。\n",
    "因此，\n",
    "\n",
    "$$f(x - \\eta f'(x)) \\lessapprox f(x).$$\n",
    "\n",
    "这意味着，如果我们使用\n",
    "\n",
    "$$x \\leftarrow x - \\eta f'(x)$$\n",
    "\n",
    "来迭代$x$，函数$f(x)$的值可能会下降。\n",
    "因此，在梯度下降中，我们首先选择初始值$x$和常数$\\eta > 0$，\n",
    "然后使用它们连续迭代$x$，直到停止条件达成。\n",
    "例如，当梯度$|f'(x)|$的幅度足够小或迭代次数达到某个值时。\n",
    "\n",
    "下面我们来展示如何实现梯度下降。为了简单起见，我们选用目标函数$f(x)=x^2$。\n",
    "尽管我们知道$x=0$时$f(x)$能取得最小值，\n",
    "但我们仍然使用这个简单的函数来观察$x$的变化。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load ../utils/djl-imports\n",
    "%load ../utils/plot-utils\n",
    "%load ../utils/Functions.java"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Function<Float, Float> f = x -> x * x; // Objective Function\n",
    "Function<Float, Float> gradf = x -> 2 * x; // Its Derivative\n",
    "\n",
    "NDManager manager = NDManager.newBaseManager();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，我们使用$x=10$作为初始值，并假设$\\eta=0.2$。\n",
    "使用梯度下降法迭代$x$共10次，我们可以看到，$x$的值最终将接近最优解。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "public float[] gd(float eta) {\n",
    "    float x = 10f;\n",
    "    float[] results = new float[11];\n",
    "    results[0] = x;\n",
    "    \n",
    "    for (int i = 0; i < 10; i++) {\n",
    "        x -= eta * gradf.apply(x);\n",
    "        results[i + 1] = x;\n",
    "    }\n",
    "    System.out.printf(\"epoch 10, x: %f\\n\", x);\n",
    "    return results;\n",
    "}\n",
    "\n",
    "float[] res = gd(0.2f);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对进行$x$优化的过程可以绘制如下。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "/* Saved in GradDescUtils.java */\n",
    "public void plotGD(float[] x, float[] y, float[] segment, Function<Float, Float> func, \n",
    "                                 int width, int height) {\n",
    "    // Function Line\n",
    "    ScatterTrace trace = ScatterTrace.builder(Functions.floatToDoubleArray(x), \n",
    "                                              Functions.floatToDoubleArray(y))\n",
    "        .mode(ScatterTrace.Mode.LINE)\n",
    "        .build();\n",
    "\n",
    "    // GD Line\n",
    "    ScatterTrace trace2 = ScatterTrace.builder(Functions.floatToDoubleArray(segment), \n",
    "                                               Functions.floatToDoubleArray(Functions.callFunc(segment, func)))\n",
    "        .mode(ScatterTrace.Mode.LINE)\n",
    "        .build();\n",
    "    \n",
    "    // GD Points\n",
    "    ScatterTrace trace3 = ScatterTrace.builder(Functions.floatToDoubleArray(segment), \n",
    "                                               Functions.floatToDoubleArray(Functions.callFunc(segment, func)))\n",
    "        .build();\n",
    "\n",
    "    Layout layout = Layout.builder()\n",
    "        .height(height)\n",
    "        .width(width)\n",
    "        .showLegend(false)\n",
    "        .build();\n",
    "    \n",
    "    display(new Figure(layout, trace, trace2, trace3));\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "/* Saved in GradDescUtils.java */\n",
    "public void showTrace(float[] res) {\n",
    "    float n = 0;\n",
    "    for (int i = 0; i < res.length; i++) {\n",
    "        if (Math.abs(res[i]) > n) {\n",
    "            n = Math.abs(res[i]);\n",
    "        }\n",
    "    }\n",
    "    NDArray fLineND = manager.arange(-n, n, 0.01f);\n",
    "    float[] fLine = fLineND.toFloatArray();\n",
    "    plotGD(fLine, Functions.callFunc(fLine, f), res, f, 500, 400);\n",
    "}\n",
    "\n",
    "showTrace(res);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 学习率\n",
    ":label:`subsec_gd-learningrate`\n",
    "\n",
    "*学习率*（learning rate）决定目标函数能否收敛到局部最小值，以及何时收敛到最小值。\n",
    "学习率$\\eta$可由算法设计者设置。\n",
    "请注意，如果我们使用的学习率太小，将导致$x$的更新非常缓慢，需要更多的迭代。\n",
    "例如，考虑同一优化问题中$\\eta = 0.05$的进度。\n",
    "如下所示，尽管经过了10个步骤，我们仍然离最优解很远。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "showTrace(gd(0.05f));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "相反，如果我们使用过高的学习率，$\\left|\\eta f'(x)\\right|$对于一阶泰勒展开式可能太大。\n",
    "也就是说， :eqref:`gd-taylor`中的$\\mathcal{O}(\\eta^2 f'^2(x))$可能变得显著了。\n",
    "在这种情况下，$x$的迭代不能保证降低$f(x)$的值。\n",
    "例如，当学习率为$\\eta=1.1$时，$x$超出了最优解$x=0$并逐渐发散。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "showTrace(gd(1.1f));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 局部最小值\n",
    "\n",
    "为了演示非凸函数的梯度下降，考虑函数$f(x) = x \\cdot \\cos(cx)$，其中$c$为某常数。\n",
    "这个函数有无穷多个局部最小值。\n",
    "根据我们选择的学习率，我们最终可能只会得到许多解的一个。\n",
    "下面的例子说明了（不切实际的）高学习率如何导致较差的局部最小值。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "float c = (float)(0.15f * Math.PI);\n",
    "\n",
    "Function<Float, Float> f = x -> x * (float)Math.cos(c * x);\n",
    "\n",
    "Function<Float, Float> gradf = x -> (float)(Math.cos(c * x) - c * x * Math.sin(c * x));\n",
    "\n",
    "showTrace(gd(2));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多元梯度下降\n",
    "\n",
    "现在我们对单变量的情况有了更好的理解，让我们考虑一下$\\mathbf{x} = [x_1, x_2, \\ldots, x_d]^\\top$的情况。\n",
    "即目标函数$f: \\mathbb{R}^d \\to \\mathbb{R}$将向量映射成标量。\n",
    "相应地，它的梯度也是多元的：它是一个由$d$个偏导数组成的向量：\n",
    "\n",
    "$$\\nabla f(\\mathbf{x}) = \\bigg[\\frac{\\partial f(\\mathbf{x})}{\\partial x_1}, \\frac{\\partial f(\\mathbf{x})}{\\partial x_2}, \\ldots, \\frac{\\partial f(\\mathbf{x})}{\\partial x_d}\\bigg]^\\top.$$\n",
    "\n",
    "梯度中的每个偏导数元素$\\partial f(\\mathbf{x})/\\partial x_i$代表了当输入$x_i$时$f$在$\\mathbf{x}$处的变化率。\n",
    "和先前单变量的情况一样，我们可以对多变量函数使用相应的泰勒近似来思考。\n",
    "具体来说，\n",
    "\n",
    "$$f(\\mathbf{x} + \\boldsymbol{\\epsilon}) = f(\\mathbf{x}) + \\mathbf{\\boldsymbol{\\epsilon}}^\\top \\nabla f(\\mathbf{x}) + \\mathcal{O}(\\|\\boldsymbol{\\epsilon}\\|^2).$$\n",
    ":eqlabel:`gd-multi-taylor`\n",
    "\n",
    "换句话说，在$\\boldsymbol{\\epsilon}$的二阶项中，\n",
    "最陡下降的方向由负梯度$-\\nabla f(\\mathbf{x})$得出。\n",
    "选择合适的学习率$\\eta > 0$来生成典型的梯度下降算法：\n",
    "\n",
    "$$\\mathbf{x} \\leftarrow \\mathbf{x} - \\eta \\nabla f(\\mathbf{x}).$$\n",
    "\n",
    "这个算法在实践中的表现如何呢？\n",
    "我们构造一个目标函数$f(\\mathbf{x})=x_1^2+2x_2^2$，\n",
    "并有二维向量$\\mathbf{x} = [x_1, x_2]^\\top$作为输入，\n",
    "标量作为输出。\n",
    "梯度由$\\nabla f(\\mathbf{x}) = [2x_1, 4x_2]^\\top$给出。\n",
    "我们将从初始位置$[-5, -2]$通过梯度下降观察$\\mathbf{x}$的轨迹。\n",
    "\n",
    "我们还需要两个辅助函数：\n",
    "第一个是update函数，并将其应用于初始值20次；\n",
    "第二个函数会显示$\\mathbf{x}$的轨迹。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "/* Saved in GradDescUtils.java */\n",
    "public class Weights {\n",
    "    public float x1, x2;\n",
    "    public Weights(float x1, float x2) {\n",
    "        this.x1 = x1;\n",
    "        this.x2 = x2;\n",
    "    }\n",
    "}\n",
    "\n",
    "/* Saved in GradDescUtils.java */\n",
    "/* Optimize a 2D objective function with a customized trainer. */\n",
    "public ArrayList<Weights> train2d(Function<Float[], Float[]> trainer, int steps) {\n",
    "    // s1和s2是稍后将使用的内部状态变量\n",
    "    float x1 = -5f, x2 = -2f, s1 = 0f, s2 = 0f;\n",
    "    ArrayList<Weights> results = new ArrayList<>();\n",
    "    results.add(new Weights(x1, x2));\n",
    "    for (int i = 1; i < steps + 1; i++) {\n",
    "        Float[] step = trainer.apply(new Float[]{x1, x2, s1, s2});\n",
    "        x1 = step[0];\n",
    "        x2 = step[1];\n",
    "        s1 = step[2];\n",
    "        s2 = step[3];\n",
    "        results.add(new Weights(x1, x2));\n",
    "        System.out.printf(\"epoch %d, x1 %f, x2 %f\\n\", i, x1, x2);\n",
    "    }\n",
    "    return results;\n",
    "}\n",
    "\n",
    "import java.util.function.BiFunction;\n",
    "\n",
    "/* Saved in GradDescUtils.java */\n",
    "/* Show the trace of 2D variables during optimization. */        \n",
    "public void showTrace2d(BiFunction<Float, Float, Float> f, ArrayList<Weights> results) {\n",
    "    // TODO: add when tablesaw adds support for contour and meshgrids\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，我们观察学习率$\\eta = 0.1$时优化变量$\\mathbf{x}$的轨迹。\n",
    "可以看到，经过20步之后，$\\mathbf{x}$的值接近其位于$[0, 0]$的最小值。\n",
    "虽然进展相当顺利，但相当缓慢。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "float eta = 0.1f;\n",
    "\n",
    "BiFunction<Float, Float, Float> f = (x1, x2) -> x1 * x1 + 2 * x2 * x2; // Objective\n",
    "\n",
    "BiFunction<Float, Float, Float[]> gradf = (x1, x2) -> new Float[]{2 * x1, 4 * x2}; // Gradient\n",
    "\n",
    "Function<Float[], Float[]> gd = (state) -> {\n",
    "    Float x1 = state[0];\n",
    "    Float x2 = state[1];\n",
    "    \n",
    "    Float[] g = gradf.apply(x1, x2); // Compute Gradient\n",
    "    Float g1 = g[0];\n",
    "    Float g2 = g[1];\n",
    "    \n",
    "    return new Float[]{x1 - eta * g1, x2 - eta * g2, 0f, 0f}; // Update Variables\n",
    "};\n",
    "\n",
    "showTrace2d(f, train2d(gd, 20));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image.png](https://d2l-java-resources.s3.amazonaws.com/img/contour_gd.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 自适应方法\n",
    "\n",
    "正如我们在 :numref:`subsec_gd-learningrate`中所看到的，选择“恰到好处”的学习率$\\eta$是很棘手的。\n",
    "如果我们把它选得太小，就没有什么进展；如果太大，得到的解就会振荡，甚至可能发散。\n",
    "如果我们可以自动确定$\\eta$，或者完全不必选择学习率，会怎么样？\n",
    "除了考虑目标函数的值和梯度、还考虑它的曲率的二阶方法可以帮我们解决这个问题。\n",
    "虽然由于计算代价的原因，这些方法不能直接应用于深度学习，但它们为如何设计高级优化算法提供了有用的思维直觉，这些算法可以模拟下面概述的算法的许多理想特性。\n",
    "\n",
    "### 牛顿法\n",
    "\n",
    "回顾一些函数$f: \\mathbb{R}^d \\rightarrow \\mathbb{R}$的泰勒展开式，事实上我们可以把它写成\n",
    "\n",
    "$$f(\\mathbf{x} + \\boldsymbol{\\epsilon}) = f(\\mathbf{x}) + \\boldsymbol{\\epsilon}^\\top \\nabla f(\\mathbf{x}) + \\frac{1}{2} \\boldsymbol{\\epsilon}^\\top \\nabla^2 f(\\mathbf{x}) \\boldsymbol{\\epsilon} + \\mathcal{O}(\\|\\boldsymbol{\\epsilon}\\|^3).$$\n",
    ":eqlabel:`gd-hot-taylor`\n",
    "\n",
    "为了避免繁琐的符号，我们将$\\mathbf{H} \\stackrel{\\mathrm{def}}{=} \\nabla^2 f(\\mathbf{x})$定义为$f$的Hessian，是$d \\times d$矩阵。\n",
    "当$d$的值很小且问题很简单时，$\\mathbf{H}$很容易计算。\n",
    "但是对于深度神经网络而言，考虑到$\\mathbf{H}$可能非常大，\n",
    "$\\mathcal{O}(d^2)$个条目的存储代价会很高，\n",
    "此外通过反向传播进行计算可能雪上加霜。\n",
    "然而，我们姑且先忽略这些考量，看看会得到什么算法。\n",
    "\n",
    "毕竟，$f$的最小值满足$\\nabla f = 0$。\n",
    "遵循 :numref:`sec_calculus`中的微积分规则，\n",
    "通过取$\\boldsymbol{\\epsilon}$对 :eqref:`gd-hot-taylor`的导数，\n",
    "再忽略不重要的高阶项，我们便得到\n",
    "\n",
    "$$\\nabla f(\\mathbf{x}) + \\mathbf{H} \\boldsymbol{\\epsilon} = 0 \\text{ and hence }\n",
    "\\boldsymbol{\\epsilon} = -\\mathbf{H}^{-1} \\nabla f(\\mathbf{x}).$$\n",
    "\n",
    "也就是说，作为优化问题的一部分，我们需要将Hessian矩阵$\\mathbf{H}$求逆。\n",
    "\n",
    "举一个简单的例子，对于$f(x) = \\frac{1}{2} x^2$，我们有$\\nabla f(x) = x$和$\\mathbf{H} = 1$。\n",
    "因此，对于任何$x$，我们可以获得$\\epsilon = -x$。\n",
    "换言之，单单一步就足以完美地收敛，而无须任何调整。\n",
    "我们在这里比较幸运：泰勒展开式是确切的，因为$f(x+\\epsilon)= \\frac{1}{2} x^2 + \\epsilon x + \\frac{1}{2} \\epsilon^2$。\n",
    "\n",
    "让我们看看其他问题。\n",
    "给定一个凸双曲余弦函数$c$，其中$c$为某些常数，\n",
    "我们可以看到经过几次迭代后，得到了$x=0$处的全局最小值。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "float c = 0.5f;\n",
    "\n",
    "Function<Float, Float> f = x -> (float)Math.cosh(c * x); // Objective\n",
    "\n",
    "Function<Float, Float> gradf = x -> c * (float)Math.sinh(c * x); // Derivative\n",
    "\n",
    "Function<Float, Float> hessf = x -> c * c * (float)Math.cosh(c * x); // Hessian\n",
    "\n",
    "// Hide learning rate for now\n",
    "public float[] newton(float eta) {\n",
    "    float x = 10f;\n",
    "    float[] results = new float[11];\n",
    "    results[0] = x;\n",
    "    \n",
    "    for (int i = 0; i < 10; i++) {\n",
    "        x -= eta * gradf.apply(x) / hessf.apply(x);\n",
    "        results[i + 1] = x;\n",
    "    }\n",
    "    System.out.printf(\"epoch 10, x: %f\\n\", x);\n",
    "    return results;\n",
    "}\n",
    "\n",
    "showTrace(newton(1));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在让我们考虑一个非凸函数，比如$f(x) = x \\cos(c x)$，$c$为某些常数。\n",
    "请注意在牛顿法中，我们最终将除以Hessian。\n",
    "这意味着如果二阶导数是负的，$f$的值可能会趋于增加。\n",
    "这是这个算法的致命缺陷！\n",
    "让我们看看实践中会发生什么。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = 0.15f * (float)Math.PI;\n",
    "\n",
    "Function<Float, Float> f = x -> x * (float)Math.cos(c * x);\n",
    "\n",
    "Function<Float, Float> gradf = x -> (float)(Math.cos(c * x) - c * x * Math.sin(c * x));\n",
    "\n",
    "Function<Float, Float> hessf = x -> (float)(-2 * c * Math.sin(c * x) - \n",
    "                                    x * c * c * Math.cos(c * x));\n",
    "                                    \n",
    "showTrace(newton(1));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这发生了惊人的错误。我们怎样才能修正它？\n",
    "一种方法是用取Hessian的绝对值来修正，另一个策略是重新引入学习率。\n",
    "这似乎违背了初衷，但不完全是——拥有二阶信息可以使我们在曲率较大时保持谨慎，而在目标函数较平坦时则采用较大的学习率。\n",
    "让我们看看在学习率稍小的情况下它是如何生效的，比如$\\eta = 0.5$。\n",
    "如我们所见，我们有了一个相当高效的算法。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "showTrace(newton(0.5f));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 收敛性分析\n",
    "\n",
    "在此，我们以三次可微的目标凸函数$f$为例，分析它的牛顿法收敛速度。\n",
    "假设它们的二阶导数不为零，即$f'' > 0$。\n",
    "\n",
    "用$x^{(k)}$表示$x$在第$k^\\mathrm{th}$次迭代时的值，\n",
    "令$e^{(k)} \\stackrel{\\mathrm{def}}{=} x^{(k)} - x^*$表示$k^\\mathrm{th}$迭代时与最优性的距离。\n",
    "通过泰勒展开，我们得到条件$f'(x^*) = 0$可以写成\n",
    "\n",
    "$$0 = f'(x^{(k)} - e^{(k)}) = f'(x^{(k)}) - e^{(k)} f''(x^{(k)}) + \\frac{1}{2} (e^{(k)})^2 f'''(\\xi^{(k)}),$$\n",
    "\n",
    "这对某些$\\xi^{(k)} \\in [x^{(k)} - e^{(k)}, x^{(k)}]$成立。\n",
    "将上述展开除以$f''(x^{(k)})$得到\n",
    "\n",
    "$$e^{(k)} - \\frac{f'(x^{(k)})}{f''(x^{(k)})} = \\frac{1}{2} (e^{(k)})^2 \\frac{f'''(\\xi^{(k)})}{f''(x^{(k)})}.$$\n",
    "\n",
    "回想之前的方程$x^{(k+1)} = x^{(k)} - f'(x^{(k)}) / f''(x^{(k)})$。\n",
    "插入这个更新方程，取两边的绝对值，我们得到\n",
    "\n",
    "$$\\left|e^{(k+1)}\\right| = \\frac{1}{2}(e^{(k)})^2 \\frac{\\left|f'''(\\xi^{(k)})\\right|}{f''(x^{(k)})}.$$\n",
    "\n",
    "因此，每当我们处于有界区域$\\left|f'''(\\xi^{(k)})\\right| / (2f''(x^{(k)})) \\leq c$，\n",
    "我们就有一个二次递减误差\n",
    "\n",
    "$$\\left|e^{(k+1)}\\right| \\leq c (e^{(k)})^2.$$\n",
    "\n",
    "另一方面，优化研究人员称之为“线性”收敛，而$\\left|e^{(k+1)}\\right| \\leq \\alpha \\left|e^{(k)}\\right|$这样的条件称为“恒定”收敛速度。\n",
    "请注意，我们无法估计整体收敛的速度，但是一旦我们接近极小值，收敛将变得非常快。\n",
    "另外，这种分析要求$f$在高阶导数上表现良好，即确保$f$在变化他的值方面没有任何“超常”的特性。\n",
    "\n",
    "### 预处理\n",
    "\n",
    "计算和存储完整的Hessian非常昂贵，而改善这个问题的一种方法是“预处理”。\n",
    "它回避了计算整个Hessian，而只计算“对角线”项，即如下的算法更新：\n",
    "\n",
    "$$\\mathbf{x} \\leftarrow \\mathbf{x} - \\eta \\mathrm{diag}(\\mathbf{H})^{-1} \\nabla f(\\mathbf{x}).$$\n",
    "\n",
    "虽然这不如完整的牛顿法精确，但它仍然比不使用要好得多。\n",
    "为什么预处理有效呢？\n",
    "假设一个变量以毫米表示高度，另一个变量以公里表示高度的情况。\n",
    "假设这两种自然尺度都以米为单位，那么我们的参数化就出现了严重的不匹配。\n",
    "幸运的是，使用预处理可以消除这种情况。\n",
    "梯度下降的有效预处理相当于为每个变量选择不同的学习率（矢量$\\mathbf{x}$的坐标）。\n",
    "我们将在后面一节看到，预处理推动了随机梯度下降优化算法的一些创新。\n",
    "\n",
    "### 梯度下降和线搜索\n",
    "\n",
    "梯度下降的一个关键问题是我们可能会超过目标或进展不足，\n",
    "解决这一问题的简单方法是结合使用线搜索和梯度下降。\n",
    "也就是说，我们使用$\\nabla f(\\mathbf{x})$给出的方向，\n",
    "然后进行二分搜索，以确定哪个学习率$\\eta$使$f(\\mathbf{x} - \\eta \\nabla f(\\mathbf{x}))$取最小值。\n",
    "\n",
    "有关分析和证明，此算法收敛迅速（请参见 :cite:`Boyd.Vandenberghe.2004`）。\n",
    "然而，对深度学习而言，这不太可行。\n",
    "因为线搜索的每一步都需要评估整个数据集上的目标函数，实现它的方式太昂贵了。\n",
    "\n",
    "## 小结\n",
    "\n",
    "* 学习率的大小很重要：学习率太大会使模型发散，学习率太小会没有进展。\n",
    "* 梯度下降会可能陷入局部极小值，而得不到全局最小值。\n",
    "* 在高维模型中，调整学习率是很复杂的。\n",
    "* 预处理有助于调节比例。\n",
    "* 牛顿法在凸问题中一旦开始正常工作，速度就会快得多。\n",
    "* 对于非凸问题，不要不作任何调整就使用牛顿法。\n",
    "\n",
    "## 练习\n",
    "\n",
    "1. 用不同的学习率和目标函数进行梯度下降实验。\n",
    "1. 在区间$[a, b]$中实现线搜索以最小化凸函数。\n",
    "    1. 你是否需要导数来进行二分搜索，即决定选择$[a, (a+b)/2]$还是$[(a+b)/2, b]$。\n",
    "    1. 算法的收敛速度有多快？\n",
    "    1. 实现该算法，并将其应用于求$\\log (\\exp(x) + \\exp(-2x -3))$的最小值。\n",
    "1. 设计一个定义在$\\mathbb{R}^2$上的目标函数，它的梯度下降非常缓慢。提示：不同坐标的缩放方式不同。\n",
    "1. 使用预处理实现牛顿方法的轻量级版本：\n",
    "    1. 使用对角Hessian作为预条件。\n",
    "    1. 使用它的绝对值，而不是实际值（可能有符号）。\n",
    "    1. 将此应用于上述问题。\n",
    "1. 将上述算法应用于多个目标函数（凸或非凸）。如果你把坐标旋转$45$度会怎么样？\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "14.0.2+12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
